3670: perf: use aligned pointer reads for SparkUnsafeRow field accessors#46
3670: perf: use aligned pointer reads for SparkUnsafeRow field accessors#46martin-augment wants to merge 4 commits intomainfrom
Conversation
SparkUnsafeRow field offsets are always 8-byte aligned (the JVM guarantees 8-byte alignment on the base address, bitset_width is a multiple of 8, and each field slot is 8 bytes). This means we can safely use ptr::read() instead of the from_le_bytes(slice) pattern for all typed accesses, avoiding slice creation and try_into overhead. Move primitive accessor implementations out of the SparkUnsafeObject trait defaults and into each concrete impl via a macro parameterized on the read method: - SparkUnsafeRow uses ptr::read() (aligned) - SparkUnsafeArray uses ptr::read_unaligned() (may be unaligned when nested in a row's variable-length region) Also switch is_null_at/set_not_null_at in SparkUnsafeRow from read_unaligned/write_unaligned to aligned read/write, since the null bitset is always at 8-byte aligned offsets within the row.
The test_append_null_struct_field_to_struct_builder test used a plain [u8; 16] stack buffer with no alignment guarantee. Since is_null_at performs aligned i64 reads, Miri flags this as undefined behavior when the buffer lands at a non-8-byte-aligned address. Wrap the buffer in a #[repr(align(8))] struct to match the alignment that real Spark UnsafeRow data always has from JVM memory.
WalkthroughThe changes modify Spark unsafe object accessors by introducing macro-based primitive getter implementations. In ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors how primitive fields are accessed within Spark UnsafeRow and UnsafeArray structures in Rust, improving performance and correctness by explicitly handling memory alignment. It introduces a macro to generate accessor methods that use either aligned Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a performance optimization by using aligned pointer reads for SparkUnsafeRow field accessors, and unaligned reads for SparkUnsafeArray. A macro is used to generate the accessor methods, which is a clean approach to handle the different alignment guarantees. While the logic for choosing between aligned and unaligned reads is sound, the implementation of the primitive accessors overlooks data endianness. This introduces a critical correctness bug that can lead to silent data corruption on big-endian architectures. I have provided a detailed comment and a code suggestion to fix this issue.
| macro_rules! impl_primitive_accessors { | ||
| ($read_method:ident) => { | ||
| #[inline] | ||
| fn get_boolean(&self, index: usize) -> bool { | ||
| let addr = self.get_element_offset(index, 1); | ||
| debug_assert!( | ||
| !addr.is_null(), | ||
| "get_boolean: null pointer at index {index}" | ||
| ); | ||
| // SAFETY: addr points to valid element data within the row/array region. | ||
| unsafe { *addr != 0 } | ||
| } | ||
|
|
||
| #[inline] | ||
| fn get_byte(&self, index: usize) -> i8 { | ||
| let addr = self.get_element_offset(index, 1); | ||
| debug_assert!(!addr.is_null(), "get_byte: null pointer at index {index}"); | ||
| // SAFETY: addr points to valid element data (1 byte) within the row/array region. | ||
| unsafe { *(addr as *const i8) } | ||
| } | ||
|
|
||
| #[inline] | ||
| fn get_short(&self, index: usize) -> i16 { | ||
| let addr = self.get_element_offset(index, 2) as *const i16; | ||
| debug_assert!(!addr.is_null(), "get_short: null pointer at index {index}"); | ||
| // SAFETY: addr points to valid element data (2 bytes) within the row/array region. | ||
| unsafe { addr.$read_method() } | ||
| } | ||
|
|
||
| #[inline] | ||
| fn get_int(&self, index: usize) -> i32 { | ||
| let addr = self.get_element_offset(index, 4) as *const i32; | ||
| debug_assert!(!addr.is_null(), "get_int: null pointer at index {index}"); | ||
| // SAFETY: addr points to valid element data (4 bytes) within the row/array region. | ||
| unsafe { addr.$read_method() } | ||
| } | ||
|
|
||
| #[inline] | ||
| fn get_long(&self, index: usize) -> i64 { | ||
| let addr = self.get_element_offset(index, 8) as *const i64; | ||
| debug_assert!(!addr.is_null(), "get_long: null pointer at index {index}"); | ||
| // SAFETY: addr points to valid element data (8 bytes) within the row/array region. | ||
| unsafe { addr.$read_method() } | ||
| } | ||
|
|
||
| #[inline] | ||
| fn get_float(&self, index: usize) -> f32 { | ||
| let addr = self.get_element_offset(index, 4) as *const f32; | ||
| debug_assert!(!addr.is_null(), "get_float: null pointer at index {index}"); | ||
| // SAFETY: addr points to valid element data (4 bytes) within the row/array region. | ||
| unsafe { addr.$read_method() } | ||
| } | ||
|
|
||
| #[inline] | ||
| fn get_double(&self, index: usize) -> f64 { | ||
| let addr = self.get_element_offset(index, 8) as *const f64; | ||
| debug_assert!(!addr.is_null(), "get_double: null pointer at index {index}"); | ||
| // SAFETY: addr points to valid element data (8 bytes) within the row/array region. | ||
| unsafe { addr.$read_method() } | ||
| } | ||
|
|
||
| #[inline] | ||
| fn get_date(&self, index: usize) -> i32 { | ||
| let addr = self.get_element_offset(index, 4) as *const i32; | ||
| debug_assert!(!addr.is_null(), "get_date: null pointer at index {index}"); | ||
| // SAFETY: addr points to valid element data (4 bytes) within the row/array region. | ||
| unsafe { addr.$read_method() } | ||
| } | ||
|
|
||
| #[inline] | ||
| fn get_timestamp(&self, index: usize) -> i64 { | ||
| let addr = self.get_element_offset(index, 8) as *const i64; | ||
| debug_assert!( | ||
| !addr.is_null(), | ||
| "get_timestamp: null pointer at index {index}" | ||
| ); | ||
| // SAFETY: addr points to valid element data (8 bytes) within the row/array region. | ||
| unsafe { addr.$read_method() } | ||
| } | ||
| }; | ||
| } |
There was a problem hiding this comment.
The current implementation of primitive accessors has a correctness issue on big-endian systems. ptr::read() and ptr::read_unaligned() interpret bytes using the native endianness of the CPU. However, Spark's Unsafe format is little-endian. This will lead to incorrect data being read on big-endian architectures, causing silent data corruption.
To ensure portability and correctness, you should explicitly handle the little-endian format. For integer types, you can use T::from_le(). For floating-point types, you should read the data as an integer of the same size, convert its endianness, and then convert the bits to a float (e.g., f32::from_bits(u32::from_le(...))).
macro_rules! impl_primitive_accessors {
($read_method:ident) => {
#[inline]
fn get_boolean(&self, index: usize) -> bool {
let addr = self.get_element_offset(index, 1);
debug_assert!(
!addr.is_null(),
"get_boolean: null pointer at index {index}"
);
// SAFETY: addr points to valid element data within the row/array region.
unsafe { *addr != 0 }
}
#[inline]
fn get_byte(&self, index: usize) -> i8 {
let addr = self.get_element_offset(index, 1);
debug_assert!(!addr.is_null(), "get_byte: null pointer at index {index}");
// SAFETY: addr points to valid element data (1 byte) within the row/array region.
unsafe { *(addr as *const i8) }
}
#[inline]
fn get_short(&self, index: usize) -> i16 {
let addr = self.get_element_offset(index, 2) as *const i16;
debug_assert!(!addr.is_null(), "get_short: null pointer at index {index}");
// SAFETY: addr points to valid element data (2 bytes) within the row/array region.
// Spark's Unsafe format is little-endian, so we must convert from LE to native.
unsafe { i16::from_le(addr.$read_method()) }
}
#[inline]
fn get_int(&self, index: usize) -> i32 {
let addr = self.get_element_offset(index, 4) as *const i32;
debug_assert!(!addr.is_null(), "get_int: null pointer at index {index}");
// SAFETY: addr points to valid element data (4 bytes) within the row/array region.
// Spark's Unsafe format is little-endian, so we must convert from LE to native.
unsafe { i32::from_le(addr.$read_method()) }
}
#[inline]
fn get_long(&self, index: usize) -> i64 {
let addr = self.get_element_offset(index, 8) as *const i64;
debug_assert!(!addr.is_null(), "get_long: null pointer at index {index}");
// SAFETY: addr points to valid element data (8 bytes) within the row/array region.
// Spark's Unsafe format is little-endian, so we must convert from LE to native.
unsafe { i64::from_le(addr.$read_method()) }
}
#[inline]
fn get_float(&self, index: usize) -> f32 {
let addr = self.get_element_offset(index, 4) as *const u32;
debug_assert!(!addr.is_null(), "get_float: null pointer at index {index}");
// SAFETY: addr points to valid element data (4 bytes) within the row/array region.
// Spark's Unsafe format is little-endian. Read as u32, convert, then transmute.
unsafe { f32::from_bits(u32::from_le(addr.$read_method())) }
}
#[inline]
fn get_double(&self, index: usize) -> f64 {
let addr = self.get_element_offset(index, 8) as *const u64;
debug_assert!(!addr.is_null(), "get_double: null pointer at index {index}");
// SAFETY: addr points to valid element data (8 bytes) within the row/array region.
// Spark's Unsafe format is little-endian. Read as u64, convert, then transmute.
unsafe { f64::from_bits(u64::from_le(addr.$read_method())) }
}
#[inline]
fn get_date(&self, index: usize) -> i32 {
let addr = self.get_element_offset(index, 4) as *const i32;
debug_assert!(!addr.is_null(), "get_date: null pointer at index {index}");
// SAFETY: addr points to valid element data (4 bytes) within the row/array region.
// Spark's Unsafe format is little-endian, so we must convert from LE to native.
unsafe { i32::from_le(addr.$read_method()) }
}
#[inline]
fn get_timestamp(&self, index: usize) -> i64 {
let addr = self.get_element_offset(index, 8) as *const i64;
debug_assert!(
!addr.is_null(),
"get_timestamp: null pointer at index {index}"
);
// SAFETY: addr points to valid element data (8 bytes) within the row/array region.
// Spark's Unsafe format is little-endian, so we must convert from LE to native.
unsafe { i64::from_le(addr.$read_method()) }
}
};
}There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| /// Returns double value at the given index of the object. | ||
| #[inline] | ||
| fn get_double(&self, index: usize) -> f64 { | ||
| let addr = self.get_element_offset(index, 8); |
There was a problem hiding this comment.
Aligned read on potentially unaligned nested struct from array
High Severity
The trait method get_struct returns a SparkUnsafeRow when called on a SparkUnsafeArray. Since SparkUnsafeRow uses aligned ptr::read() for all typed accesses, but the PR's own documentation states that SparkUnsafeArray base addresses "may be unaligned when nested within a row's variable-length region," the resulting SparkUnsafeRow from array.get_struct(idx, ...) would inherit that misalignment. Calling is_null_at, get_int, get_long, etc. on such a row invokes aligned ptr::read() on potentially unaligned memory, which is undefined behavior. This path is exercised in list.rs when arrays contain struct elements.
Additional Locations (2)
Code Review: perf: use aligned pointer reads for SparkUnsafeRow field accessorsOverall this is a well-motivated and well-documented performance optimization. The alignment reasoning is sound and the macro approach cleanly handles the two different cases. Correctness: Endianness assumption This is the most important issue. The old code used Spark If Comet only targets little-endian architectures this is fine, but it should be documented. Consider a module-level comment noting the LE assumption, or a compile-time check (e.g. Minor: Single-byte reads have no alignment requirement, so ignoring Minor: Alignment claim wording The comment "JVM guarantee" for 8-byte base-address alignment is accurate in practice for HotSpot on x86_64, but the Java Language Specification does not formally require
Test fix is correct The Summary
The endianness point is the one I would ask to be addressed before merging, even if only as a module-level comment or a |
🤖 Augment PR SummarySummary: Improves Spark UnsafeRow/UnsafeArray primitive field access by switching from byte-slice decoding to typed pointer reads, using aligned reads where layout guarantees alignment. Changes:
Technical Notes: Adds alignment rationale to docs/comments to justify the aligned vs unaligned access strategy. 🤖 Was this summary useful? React with 👍 or 👎 |
| let addr = self.get_element_offset(index, 4) as *const i32; | ||
| debug_assert!(!addr.is_null(), "get_int: null pointer at index {index}"); | ||
| // SAFETY: addr points to valid element data (4 bytes) within the row/array region. | ||
| unsafe { addr.$read_method() } |
There was a problem hiding this comment.
ptr::$read_method() reads in native endianness, but Spark UnsafeRow/UnsafeArray encodes primitives as little-endian (e.g., SparkUnsafeArray::new uses from_le_bytes). If this crate ever targets big-endian platforms, these accessors would decode incorrectly unless you explicitly guard/document little-endian-only support.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| let mask: i64 = 1i64 << (index & 0x3f); | ||
| let word_offset = (self.row_addr + (((index >> 6) as i64) << 3)) as *const i64; | ||
| let word: i64 = word_offset.read_unaligned(); | ||
| let word: i64 = word_offset.read(); |
There was a problem hiding this comment.
Switching to word_offset.read() assumes row_addr is 8-byte aligned, but point_to_slice can point at arbitrary &[u8] (and Vec<u8> is not guaranteed 8-byte aligned by its type). If an unaligned slice is ever passed in, this becomes UB; it may be worth enforcing/documenting the alignment precondition at the API boundary.
Severity: high
Other Locations
native/core/src/execution/shuffle/spark_unsafe/row.rs:369native/core/src/execution/shuffle/spark_unsafe/row.rs:370
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@native/core/src/execution/shuffle/spark_unsafe/row.rs`:
- Around line 280-283: The safe constructor
SparkUnsafeRow::point_to_slice(&[u8]) currently accepts unaligned slices but
downstream helpers generated by impl_primitive_accessors!(read) and methods
is_null_at() / set_not_null_at() use aligned ptr::read()/write(), so add a
runtime alignment check in point_to_slice that asserts the row base pointer is
8-byte aligned (e.g., check (row_addr as usize) % 8 == 0 or use
is_aligned_to(8)) and panic with a clear message if not; this enforces the
8-byte alignment invariant at the safe entry point and prevents undefined
behavior when benchmarks pass unaligned Vec<u8> buffers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1947f5c5-159a-4107-9f1f-01cac75608cf
📒 Files selected for processing (2)
native/core/src/execution/shuffle/spark_unsafe/list.rsnative/core/src/execution/shuffle/spark_unsafe/row.rs
| // SparkUnsafeRow field offsets are always 8-byte aligned: the base address is 8-byte | ||
| // aligned (JVM guarantee), bitset_width is a multiple of 8, and each field slot is | ||
| // 8 bytes. This means we can safely use aligned ptr::read() for all typed accesses. | ||
| impl_primitive_accessors!(read); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n -C3 '\bpoint_to_slice\s*\(' --type rustRepository: martin-augment/datafusion-comet
Length of output: 1955
🏁 Script executed:
#!/bin/bash
# Check the Aligned struct definition and its repr attribute
sed -n '1690,1700p' native/core/src/execution/shuffle/spark_unsafe/row.rs
# Check the benchmark context
sed -n '225,250p' native/core/benches/row_columnar.rsRepository: martin-augment/datafusion-comet
Length of output: 1426
🏁 Script executed:
#!/bin/bash
# Find the benchmark function signature and where it's called from
rg -B5 'rows: &\[Vec<u8>\]' native/core/benches/row_columnar.rs | head -40Repository: martin-augment/datafusion-comet
Length of output: 254
🏁 Script executed:
#!/bin/bash
# Find calls to run_benchmark to see how rows are created
rg -A10 'run_benchmark\(' native/core/benches/row_columnar.rs | head -60Repository: martin-augment/datafusion-comet
Length of output: 1397
🏁 Script executed:
#!/bin/bash
# Find where rows Vec<Vec<u8>> is populated in the benchmark
rg -B15 'run_benchmark.*rows' native/core/benches/row_columnar.rs | grep -A15 'let mut rows\|rows\.push\|rows = ' | head -50Repository: martin-augment/datafusion-comet
Length of output: 57
🏁 Script executed:
#!/bin/bash
# Get broader context - look at the main benchmark setup
sed -n '1,150p' native/core/benches/row_columnar.rs | tail -80Repository: martin-augment/datafusion-comet
Length of output: 2852
🏁 Script executed:
#!/bin/bash
# Check the impl_primitive_accessors macro and how it's used with aligned reads
sed -n '280,290p' native/core/src/execution/shuffle/spark_unsafe/row.rs
# Also check the aligned read/write at lines 347-370
sed -n '347,375p' native/core/src/execution/shuffle/spark_unsafe/row.rsRepository: martin-augment/datafusion-comet
Length of output: 2007
Enforce the 8-byte row alignment invariant at the safe entry point.
impl_primitive_accessors!(read) and the aligned ptr::read()/ptr::write() calls in is_null_at() and set_not_null_at() require row_addr to be 8-byte aligned, but SparkUnsafeRow::point_to_slice(&[u8]) accepts unaligned slices. The benchmark passes Vec<u8> buffers (alignment = 1) to point_to_slice(), which later causes undefined behavior when aligned pointer operations dereference misaligned addresses.
Add an alignment assertion at the safe boundary to catch this at runtime:
Suggested guard
pub fn point_to_slice(&mut self, slice: &[u8]) {
+ assert!(
+ slice.is_empty()
+ || (slice.as_ptr() as usize) % std::mem::align_of::<i64>() == 0,
+ "SparkUnsafeRow::point_to_slice requires an 8-byte aligned buffer"
+ );
self.row_addr = slice.as_ptr() as i64;
self.row_size = slice.len() as i32;
}The test at line 1695 is safe because it uses #[repr(align(8))], but the benchmark at line 236 passes unaligned Vec<u8> slices and will panic or exhibit undefined behavior without this guard.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@native/core/src/execution/shuffle/spark_unsafe/row.rs` around lines 280 -
283, The safe constructor SparkUnsafeRow::point_to_slice(&[u8]) currently
accepts unaligned slices but downstream helpers generated by
impl_primitive_accessors!(read) and methods is_null_at() / set_not_null_at() use
aligned ptr::read()/write(), so add a runtime alignment check in point_to_slice
that asserts the row base pointer is 8-byte aligned (e.g., check (row_addr as
usize) % 8 == 0 or use is_aligned_to(8)) and panic with a clear message if not;
this enforces the 8-byte alignment invariant at the safe entry point and
prevents undefined behavior when benchmarks pass unaligned Vec<u8> buffers.


3670: To review by AI